Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join from an existing snapshot #1532

Merged
merged 69 commits into from
Aug 27, 2020
Merged

Conversation

jumaffre
Copy link
Contributor

@jumaffre jumaffre commented Aug 25, 2020

The PR that puts all the previous snapshot PRs together.

It is now possible for a node to join a network from an existing snapshot, making the catch-up phase much quicker (basically, only replaying the entries from the last snapshot to the last index).

To do so, the operator(s) of the CCF network are responsible for copying the snapshot file(s) produced by one node (under the snapshots/ folder) to the directory of the new joiner. The new joiner will automatically try to join from the latest (based on the snapshot.idx file name) snapshot, deserialising the snapshot in its store when joining (*).

Raft will then be forced to become follower at a specific commit index. The snapshotter and the ledger will also have to be initialised at that index so that the ledger and snapshots on the joiner node are the same than those on other nodes in the network.

I've also added a new snapshot test suite to test this more thoroughly + all the necessary changes to the Python infra to get the snapshots from one node copied over to the new node.

(*) The snapshot can only be deserialised after the ledger secrets have been passed in by the network so that the snapshot can be decrypted.

Next:

  • More thorough assertions on the content of ledger/snapshot files during end-to-end testing
  • Joiner only becomes part of network once it has seen the signed snapshot evidence

@@ -571,13 +588,14 @@ if(BUILD_TESTS)
NAME reconfiguration_test
PYTHON_SCRIPT ${CMAKE_SOURCE_DIR}/tests/reconfiguration.py
CONSENSUS raft
ADDITIONAL_ARGS --raft-election-timeout 4000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowering the raft election timeout here (default is 100s) as the reconfiguration test now also waits for a new election to complete.

auto seqno = network.tables->current_version();
consensus->force_become_backup(seqno, sig->view);

reset_data(config.joining.snapshot);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the snapshot has been applied, we don't want to keep it around in enclave memory.

@ccf-bot
Copy link
Collaborator

ccf-bot commented Aug 26, 2020

join_from_snapshot@11998 aka 20200827.8 vs master ewma over 50 builds from 11569 to 11985
images

src/host/main.cpp Outdated Show resolved Hide resolved
src/host/snapshot.h Outdated Show resolved Hide resolved

reset_data(config.joining.snapshot);
LOG_INFO_FMT(
"Joiner successfully resumed from snapshot at seqno {} and "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a slightly more final than it really is: we've started resuming from this snapshot, but until we see the evidence from the primary, we don't know if the snapshot is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. That's the next step for this feature so I will address that in the next PR.

src/node/snapshotter.h Outdated Show resolved Hide resolved
Copy link
Member

@achamayou achamayou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor comments.

jumaffre and others added 6 commits August 27, 2020 09:37
@jumaffre jumaffre merged commit e42b511 into microsoft:master Aug 27, 2020
@achamayou achamayou mentioned this pull request Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants